Introduction

General

Mathematical Model

Data Description

Core Example

Python

R

Stata

Data cleaning and visualization

Firstly, we import the data and do data cleanning. We drop the variable Trades and %Deliverable. Also, we transform the Date variable from string type to date type to treat the whole data set as time series data set.

Since some stocks changed thier name/symbols during this time of period, we need to fix the inconsistency problem and merge the spilted data together.


import delimited NIFTY50_all.csv, clear

* Data Cleaning
gen date2 = date(date, "YMD")
format date2 %tdCCYY-nn-dd
drop date series
drop trades deliverablevolume
rename date2 date
label variable date "Date"

* Replace Symbol Names
replace symbol = "ADANIPORTS" if symbol == "MUNDRAPORT"
replace symbol = "AXISBANK" if symbol == "UTIBANK"
replace symbol = "BAJFINANCE" if symbol == "BAJAUTOFIN"
replace symbol = "BHARTIARTL" if symbol == "BHARTI"
replace symbol = "HEROMOTOCO" if symbol == "HEROHONDA"
replace symbol = "HINDALCO" if symbol == "HINDALC0"
replace symbol = "HINDUNILVR" if symbol == "HINDLEVER"
replace symbol = "INFY" if symbol == "INFOSYSTCH"
replace symbol = "JSWSTEEL" if symbol == "JSWSTL"
replace symbol = "KOTAKBANK" if symbol == "KOTAKMAH"
replace symbol = "TATAMOTORS" if symbol == "TELCO"
replace symbol = "TATASTEEL" if symbol == "TISCO"
replace symbol = "UPL" if symbol == "UNIPHOS"
replace symbol = "VEDL" if symbol == "SESAGOA"
replace symbol = "VEDL" if symbol == "SSLT"
replace symbol = "ZEEL" if symbol == "ZEETELE"

* Save the cleaned data
save NIFTY_clean, replace 

Then we visualize the data and the stock “ADANIPORTS” is taken as an example.

use NIFTY_clean, clear

keep if symbol == "ADANIPORTS"

graph twoway line vwap date, color("blue") xtitle("Days") ///
ytitle("Volume weighted average price")
graph export vwap_date.png, replace
graph twoway line volume date, color("blue") xtitle("Days") ytitle("Volume")
graph export volume_date.png, replace
graph twoway line turnover date, color("blue") xtitle("Days") ytitle("Turnover")
graph export turnover_date.png, replace

ada_vwap ada_volume ada_date

Determine model parameters

We will use the time series VWAP for the analysis below.

For all stocks, we do Augmented Dickey-Fuller tests to determine whether the time series are stationary or not.

use NIFTY_clean, clear

local sbls_f5 = "ADANIPORTS ASIANPAINT AXISBANK BAJAJ-AUTO BAJAJFINSV"

foreach sym of local sbls_f5 {
    use NIFTY_clean, clear
    keep if symbol == "`sym'"
    tsset date
    dfuller d1.vwap
}

We do the test on the vwap with the first-order differentiation. All stocks are reporting minimum p-values, hence we decide to use \(d=1\) for all stocks.

Then, in order to find AR parameter \(p\) of the model, we generate the partial autoregressive (PACF) plots together with autoregressive (ACF) plots. Here, the parameter \(p\) represents the number of lags of this model. We only consider relationships for one variable and \(p\) variables beyond it. The MA parameter \(q\) has exactly the same meaning as AR models.

Note: we will only plot the first 5 stocks as an example.

use NIFTY_clean, clear

foreach sym of local sbls_f5 {
    use NIFTY_clean, clear
    keep if symbol == "`sym'"
    tsset date
    ac vwap
    graph export acf_`sym'.png
    pac vwap
    graph export pacf_`sym'.png
}

The PACF plots for these stocks are the following:

pacf_ADA pacf_ASI pacf_AXI pacf_BAJ_A pacf_BAJ_F

And the ACF plots for the these 5 stocks are the following:

acf_ADA acf_ASI acf_AXI acf_BAJ_A acf_BAJ_F

We can get the similar conclusion that lag 1 is absolutely significant while lag 2 is not, hencewe can choose \(p=1\) for the AR term and \(q=1\) for the MA term for all stocks.

Fit models

According to the process above, we choose the \(ARIMA(1, 1, 1)\) (where the first parameter is \(p\) , the second is \(d\) and the third is \(p\)) for all stocks. However, diagnostics tells sometimes the \(ARIMA(1, 1, 0)\) performs better for some stocks. Hence, we try to use the better model to fit the data and then plot the predicted values against original values.

Note: we will only plot the first 5 stocks as an example.

use NIFTY_clean, clear

local sbls_f5 = "ADANIPORTS ASIANPAINT AXISBANK BAJAJ-AUTO BAJAJFINSV"

foreach sym of local sbls_f5 {
    use NIFTY_clean, clear
    keep if symbol == "`sym'"
    tsset date
    arima vwap, arima(1,1,1)
    estat ic
    mat l_aim = r(S)
    scalar aic_aim = l_aim[1,5]
    arima vwap, arima(1,1,0)
    estat ic
    mat l_ai = r(S)
    scalar aic_ai = l_aim[1,5]
    if aic_aim > aic_ai {
        tsappend, add(200)
        arima vwap, arima(1,1,0)
        predict vwap_pd
        gen vwap_p = vwap_pd + vwap
        replace vwap_p=vwap_p[_n-1]+ vwap_pd[_n] if _n > _N - 200
        graph twoway line vwap date, lwidth("vthin") color("blue") || line ///
        vwap_p date, lwidth("vthin") color("red") lpattern("dash")
        graph export fitted_`sym'.png, replace
    } 
    else {
        tsappend, add(200)
        arima vwap, arima(1,1,1)
        predict vwap_pd
        gen vwap_p = vwap_pd + vwap
        replace vwap_p=vwap_p[_n-1]+ vwap_pd[_n] if _n > _N - 200
        graph twoway line vwap date, lwidth("vthin") color("blue") || line ///
        vwap_p date, lwidth("vthin") color("red") lpattern("dash")
        graph export fitted_`sym'.png, replace
    }
}

The regression coefficient is the following:

ADANIPORTS reg_ADA

ASIANPAINT reg_ASI

AXISBANK reg_AXI

BAJAJ-AUTO reg_BAJ_A

BAJAJFINSV reg_BAJ_F

Also, the out-of-sample prediction is implemented here. we tried to predict the tendency of the stoch price in next 200 trading days and he sample fitted graphs are:

fit_ADA fit_ASI fit_AXI fit_BAJ_A fit_BAJ_F

Model improvement

Now that we chose different models for different stocks, we can further improve the models by choosing the most proper model for each stock.

However, Stata does not have some similar funciton as auto_arima to choose models automatically. Hence, we may related to other two languages ( Python, R). Heavy and tedious computation is expected in Stata here.

Outline

References

  1. A modern Time Series tutorial: Link

  2. ARIMA model in Wikipedia: Link